Data path for scalable matrix node engine with mixed data formats

ABSTRACT

A microprocessor system comprises a matrix computational unit and a control unit. The matrix computational unit includes a plurality of processing elements. The control unit is configured to provide a matrix processor instruction to the matrix computational unit. The matrix processor instruction specifies a floating-point operand formatted using a first floating-point representation format. The matrix computational unit accumulates an intermediate result value calculated using the floating-point operand. The intermediate result value is in a second floating-point representation format.

INCORPORATION BY REFERENCE TO ANY PRIORITY APPLICATIONS

Any and all applications for which a foreign or domestic priority claimis identified in the Application Data Sheet as filed with the presentapplication are hereby incorporated by reference under 37 CFR 1.57. ThisApplication hereby incorporates by reference U.S. Pat. App. No.16/403083 and U.S. Pat. App. No. 16/421225.

BACKGROUND OF THE INVENTION Field of the Invention

Machine learning training is a data and computational intensiveoperation. The process is tedious and time consuming, requiring both asignificant amount of relevant training data and the computing resourcesto process it. Moreover, the data and computational resources onlyincrease with the complexity of the problem being solved. To train amachine learning model, high-powered CPUs perform complex matrixoperations using the training data to determine appropriate weights. Toincrease the speed of training, graphics processing units (GPUs) areused as an alternative or in addition to traditional CPUs. GPUs allowfor some of the training to be parallelized and help to optimize certainmath operations. However, GPUs are traditionally designed for processinggraphics problems such as rendering three-dimensional worlds ontotwo-dimensional displays. When applied to machine learning, GPUs canrequire significant amounts of power for the amount of computationalpower they provide. Moreover, the data formats and data pipeline used byGPUs are designed for graphics processing and not for training machinelearning models. Therefore, there exists a need for a machine learningtraining system that is powerful, computational, and power efficient.Such a system should support a high data bandwidth to significantlyincrease the amount of training data that can be processed. Moreover,the data formats and data pipeline should be optimized for the trainingdata and resulting machine learning models.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments of the invention are disclosed in the followingdetailed description and the accompanying drawings.

FIG. 1 is a flow diagram illustrating an embodiment of a process fortraining a machine learning model.

FIG. 2 is a block diagram illustrating an embodiment of a system fortraining a machine learning model.

FIG. 3 is a block diagram illustrating an embodiment of a node enginefor performing matrix computations.

FIG. 4 is a block diagram illustrating embodiments of an 8-bitfloating-point format.

FIG. 5 is a block diagram illustrating an embodiment of a 21-bitfloating-point format.

FIG. 6 is a flow diagram illustrating an embodiment of a process forperforming matrix computations.

FIG. 7 is a flow diagram illustrating an embodiment of a process forperforming matrix computations.

FIG. 8 is a flow diagram illustrating an embodiment of a process forperforming multiple interleaved matrix computations.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

The invention can be implemented in numerous ways, including as aprocess; an apparatus; a system; a composition of matter; a computerprogram product embodied on a computer readable storage medium; and/or aprocessor, such as a processor configured to execute instructions storedon and/or provided by a memory coupled to the processor. In thisspecification, these implementations, or any other form that theinvention may take, may be referred to as techniques. In general, theorder of the steps of disclosed processes may be altered within thescope of the invention. Unless stated otherwise, a component such as aprocessor or a memory described as being configured to perform a taskmay be implemented as a general component that is temporarily configuredto perform the task at a given time or a specific component that ismanufactured to perform the task. As used herein, the term ‘processor’refers to one or more devices, circuits, and/or processing coresconfigured to process data, such as computer program instructions.

A detailed description of one or more embodiments of the invention isprovided below along with accompanying figures that illustrate theprinciples of the invention. The invention is described in connectionwith such embodiments, but the invention is not limited to anyembodiment. The scope of the invention is limited only by the claims andthe invention encompasses numerous alternatives, modifications andequivalents. Numerous specific details are set forth in the followingdescription in order to provide a thorough understanding of theinvention. These details are provided for the purpose of example and theinvention may be practiced according to the claims without some or allof these specific details. For the purpose of clarity, technicalmaterial that is known in the technical fields related to the inventionhas not been described in detail so that the invention is notunnecessarily obscured.

A scalable node engine with multiple matrix processors and configurabledata formats is disclosed. As a core component of a training platformfor machine learning models, node engines can be arranged in a networkto perform training for machine learning models. As the computationaland data requirements increase, the number of node engines in thenetwork can be increased to handle the additional requirements. Thedisclosed node engines are highly efficient in terms of performance permm2 per watt compared to traditional CPUs and GPUs tasked for similarworkloads. The node engine architecture achieves this performanceimprovement in part by optimizing the data formats and the data path fora machine learning workload. For example, the node engine includesmultiple matrix processors that can each interleave multiple matrixoperations. A node engine with a group of eight matrix processors cancompute the result of a matrix multiplication every cycle. When stalledwaiting for data for a first set of related matrix operations, eachmatrix processor can interleave a second set of related matrixoperations to utilize otherwise idle computational resources. In someembodiments, the matrix operands are stored using a lower-bitfloating-point format and the intermediate and final results arecalculated using a higher-bit floating-point format. The lower-bitformat improves the read data bandwidth of the matrix processor whilethe higher-bit format preserves accuracy and precision for the matrixresult, for example, by preventing the loss of accuracy in quantizedresults. Different configurable data formats may be selected to specifydifferent data format configurations, for example, to vary the number ofbits allocated for mantissa and exponent fields. This allows the dataformat to be optimized based on the particular matrix operation used fora particular machine learning task. Additionally, the data formats mayinclude a configurable bias for biasing the exponents. This improves therange of the exponents and allows a larger range to be utilized.

In some embodiments, the node engines are arranged in a mesh-likenetwork. Each node engine includes a control unit, a memory, registers,multiple matrix processors, and a post-processing unit such as a vectorcomputational unit. The control unit can processes customizedinstructions including matrix computational instructions directed to oneof the multiple matrix processors and is used to synchronize resultsbetween different matrix processors and node engines. Matrix results maybe stored in a register file and processed using vector operations by apost-processing unit. The software running the node engines is capableof taking large matrix operations and subdividing the problem. Differentsub-components of the problem may be distributed to different nodeengines and to different matrix processors of each node engine. Forexample, two large matrices can be sliced such that each slice isoptimized to the matrix size of a matrix processor. The slices can thenbe distributed to different matrix processors of different node engineswhere matrix multiplication on the slices is performed. The result ofeach matrix multiplication can be combined to compute the multiplicationresult of the original larger matrices.

In some embodiments, a microprocessor system comprises a matrixcomputational unit and a control unit. The matrix computational unitincludes one or more processing elements. For example, the matrixcomputational unit includes a matrix of computational cells fordetermining the computational results of two elements from two operands.An 8×8 matrix computational unit includes 64 computational cells.Similarly, an MxN matrix computational unit includes MxN computationalcells. The matrix computational unit is part of a matrix processor thatis controlled via the control unit. In some embodiments, a control unitis configured to provide a matrix processor instruction to the matrixcomputational unit. For example, the control unit provides a matrixmultiplication instruction to a matrix processor for the matrixcomputation unit to perform. The matrix processor instruction specifiesa floating-point operand formatted with an exponent that has been biasedwith a specified configurable bias. For example, a matrix multiplicationinstruction specifies two floating-point matrix operands. Each elementof the matrix operands is formatted using a specific floating-pointformat and a configurable exponent bias. Along with the matrix operands,the matrix processor instruction specifies the floating-point format thematrix elements use, such as a format allocating 1-bit for the sign bit,4-bits for the exponent, 3-bits for the mantissa, and a particularexponent bias. In various embodiments, the bias is configurable byspecifying a value corresponding to an exponent bias. In someembodiments, the bias is reconfigurable. For example, a matrixinstruction may specify a new bias that is used to reconfigure theconfigurable bias. In some embodiments, the floating-point formatsupports denormal numbers to increase the number of values that can berepresented.

In some embodiments, the matrix processor instruction specifies afloating-point operand formatted using a first floating-pointrepresentation format. For example, the instruction specifies an 8-bitfloating-point format that allocates 4-bits for the exponent, 3-bits forthe mantissa, and a single sign bit. The specified format is used forthe elements of a matrix operand. The format may be selected to increasethe data bandwidth going into the matrix computational unit of thematrix processor. The matrix computational unit accumulates anintermediate result value calculated using the floating-point operand,and the intermediate result value is in a second floating-pointrepresentation format. For example, intermediate results use a differentfloating-point format such as a 21-bit floating-point format. As anotherexample, intermediate results may use a different floating-point formatsuch as a 27-bit or another appropriate floating-point format. Thenumber of bits dedicated to the intermediate results may be selected toprevent the loss of accuracy when quantizing results. A format using alarger number of bits to represent an intermediate result may beselected to prevent overflow and/or underflow errors that could resultby using the first floating-point format. The matrix computational unitoutputs an accumulated intermediate result as an output formatted in athird floating-point representation format. For example, multipleaccumulated intermediate results may be moved from the matrix processoras a matrix result. The result may be outputted using a third formatthat is compatible with the bus that the matrix processor is connectedto. For example, a node engine may utilize internal buses that are64-bytes wide. The intermediate accumulated results can be output fromthe matrix computational unit as 16-bit floating-point values, allowing32-elements to be moved from the matrix processor for each moveinstruction. An accumulated result with 64 elements can be moved fromthe matrix processor to a register file of the node engine using twomove instructions with each instruction moving 32 elements. A move highinstruction may be used to move the high 32 elements (e.g., elements32-63) and a move low instruction may be used to move the low 32elements (e.g., elements 0-31). In some embodiments, the moveinstructions are non-destructive and do not clear the contents of thesource accumulators when moving a value from the source accumulators ofa matrix processor to a memory location external to the matrixprocessor, such as an output array or register.

FIG. 1 is a flow diagram illustrating an embodiment of a process fortraining a machine learning model. For example, the process of FIG. 1can be used to train a model for autonomous or driver assisted driving.As vehicles are driven, such as by a human driver, autonomously, or by amix of both human and assisted driving, driving data can be captured.The captured data is prepared as training data and used to train a newmachine learning model to improve the driving experience. The newdriving experience can improve in areas such as safety, efficiency(power, time, etc.), comfort, performance, convenience, etc. Once thenew model is trained and validated, the newly trained model is deployedto vehicles where it is used by one or more machine learning networks toimplement the improved driving features and functionality. New featurescan include autonomous or assisted driving features such as autonomouslane changes, autonomous lane merging onto freeways, autonomous exitingof freeways, improved detection of obstacles and road scenarios, andautonomous navigation-based driving, among others. In variousembodiments, the machine learning model may be trained on a trainingplatform that utilizes multiple node engines and where each node engineincludes multiple matrix processors and configurable data formats.

At 101, data is captured for machine learning training. In someembodiments, as a vehicle is driven, either by a human, an autonomousdriving system, or both, data corresponding to vehicle driving iscaptured. The captured data of vehicle driving conditions may includeimage sensor data, vehicle operating parameters (e.g., speed, steering,etc.), vehicle type information (e.g., left-hand drive, right-handdrive, vehicle model, etc.), whether autonomous driving is enabled, thetime since the last disengagement of autonomous driving, obstaclesdetected, driving conditions, etc. The data may be captured passivelywithout interfering with the driving of the vehicle and withoutrequiring driver assistance.

In various embodiments, the vehicles may be equipped with differentarrangements of sensors to capture different forms of data. In someembodiments, the sensor data may be vision data, ultrasonic data, LiDARdata, or other appropriate sensor data. For example, an image iscaptured from a high dynamic range forward-facing camera. As anotherexample, ultrasonic data is captured from a side-facing ultrasonicsensor. In some embodiments, a vehicle is affixed with multiple sensorsfor capturing data. For example, in some embodiments, eight surroundcameras are affixed to a vehicle and provide 360 degrees of visibilityaround the vehicle with a range of up to 250 meters. Differentarrangements of camera sensors can include a wide forward camera, anarrow forward camera, a rear view camera, forward looking side cameras,and/or rearward looking side cameras. In some embodiments, additionalultrasonic and/or radar sensors are used to capture surrounding details.For example, twelve ultrasonic sensors may be affixed to the vehicle todetect both hard and soft objects. An additional forward-facing radarcan also be utilized to capture data of the surrounding environment. Invarious embodiments, radar sensors are able to capture surroundingdetail despite heavy rain, fog, dust, and other vehicles. The varioussensors are used to capture the environment surrounding the vehicle andthe captured data is stored for consideration as training data for adeep learning network.

Once captured, the captured data from one or more vehicles istransferred to a machine learning training platform. For example, avehicle with wireless connectivity, such as a cellular or WiFiconnection, can transfer the data wirelessly to a machine learningtraining platform. As another option, captured data can be downloadedfrom a vehicle when the vehicle is being serviced by technicians. Invarious embodiments, the captured data from multiple vehicles, such as afleet of vehicles, is aggregated at a machine learning training platformand used as at least one of the sources for training data.

At 103, the captured data is prepared for training a machine learningmodel. The data captured from vehicles at 101 is prepared as trainingdata. In some scenarios the data is separated into training andvalidation data. The preparation of the data may include selecting (orculling) the captured data to identify particularly good training data.In some embodiments, the data is annotated to identify features fortraining. For example, lane markers, traffic lights, traffic signs,vehicles, pedestrians, etc. may be annotated to enhance the usefulnessof the training data as part of data preparation. As another example,the data may be converted to different formats or pre-processed as partof the preparation process. In some embodiments, the data may beconverted from a source data format to a format compatible with a matrixprocessor. For example, data captured as fixed-point data may beconverted to floating-point data for increased precision.

At 105, a machine learning model is trained. Using the training dataprepared at 103, one or more machine learning models are trained. Thetraining may utilize both a training and a validation data set. In someembodiments, the training utilizes a machine learning platform that ismade up of multiple node engines and where each node engine includesmultiple matrix processors. By utilizing multiple node engines, forexample, organized into a mesh or another appropriate architecture, acomplex machine learning training problem can be parallelized andperformed more quickly and efficiently. Similarly, since each nodeengine includes multiple matrix processors, each node can performmultiple matrix operations in parallel. In some embodiments, byoperating multiple matrix processors in parallel, a node engine canoutput the result of a matrix multiplication every clock cycle. Thedelay waiting for data reads is significantly reduced, the delay betweenmatrix multiplication results is significantly reduced, and theperformance bandwidth is significantly increased.

The result of the training is one or more trained machine learningmodels. In some embodiments, multiple models are trained, each forpotentially different neural networks. For example, one machine learningmodel may be trained to utilize as input the sensor data from a forwardfacing camera and another model may be trained to utilize as input thesensor data from a side-facing ultrasonic sensor.

At 107, the trained machine learning model is distributed. For example,the trained model is distributed to and installed onto vehicles. Themodel may be installed via an over-the-air update, by a technician whileservicing a vehicle, or another means. In certain situations, the modelis packaged in a data format for easy installation on a vehicle. Forexample, the model may be compressed to minimize the time and bandwidthrequired to transmit the model to a vehicle. In some embodiments,multiple models, for example, each for a different neural network enginerunning on the vehicle, may be packaged together and transmitted as asingle package to the vehicle.

At 109, the trained machine learning model is applied. For example, anew model is utilized by a convolutional neural network on the vehicleto process sensor data and to implement autonomous driving or driverassisted features. In some embodiments, more than one model is appliedand/or more than one neural network is utilized. For example, on somevehicles, multiple neural networks are utilized to process the differentdata from different sensors. Once the new model is utilized, data can becaptured reflecting the performance of the new model and used for futuretraining. The process of FIG. 1 can be utilized to continuously improvethe performance of a machine learning network. In this manner, theprocessing loops back to 101 where data is captured. The data can beanalyzed to identify difficult use cases for the currently deployedmodel and the corresponding captured data can be utilized for futuretraining.

FIG. 2 is a block diagram illustrating an embodiment of a system fortraining a machine learning model. Using the training system of FIG. 2 ,a machine learning model can be trained for implementing autonomousand/or driver assisted driving functionality. In some embodiments, thetraining system of FIG. 2 is used to perform the process of FIG. 1 . Inthe example shown, the training system utilizes certain training-relatedsub-systems of vehicle sub-systems 201 located on a vehicle. Thetraining related sub-systems communicate with the server-side of thetraining system located in one or more training data centers 221.Vehicle sub-systems 201 includes sensors 203, deep learning network 205,AI processor 207, vehicle control module 209, network interface 211,vehicle data capture system 213, and capture data store 215. Additionalvehicle sub-systems may exist, for example, to perform otherfunctionality, but are not shown. Training data center(s) 221 includestraining platform 223, training data store 227, and model data store229. Training platform 223 includes at least one or more node engines225. The node engines are connected (e.g., in a mesh-like network) toperform parallelized processing for machine learning training. In someembodiments, training platform 223, training data store 227, and modeldata store 229 are located in a single data center but may also bedistributed or replicated across multiple data centers.

In some embodiments, a vehicle (not shown) includes vehicle sub-systems201 to implement autonomous and driver-assisted functionality and tocapture data that can be used to train one or more machine learningmodels for implementing and/or improving the functionality and/or newfeatures. In various embodiments, the different vehicle sub-systems maybe communicatively connected. For example, sensor data from sensors 203is fed to vehicle data capture system 213 for storage in capture datastore 215. The captured data is sent to training platform 223 vianetwork interface 211. As another example, sensor data from sensors 203is fed to deep learning network 205 running on AI processor 207. Theoutput of deep learning network 205 running on AI processor 207 is fedto vehicle control module 209. In various embodiments, network interface211 is a wireless network interface such as one that includes WiFiand/or cellular network connectivity. Network interface 211 is used tocommunicate with remote servers, to make phone calls, to send and/orreceive text messages, to transmit sensor data to training platform 223,etc. In some embodiments, vehicle sub-systems 201 may include additionalor fewer sub-systems as appropriate. For example, in some embodiments,an image pre-processor (not shown) is utilized for pre-processingcaptured sensor data. As another example, in some embodiments, apost-processing component (not shown) is used to perform post-processingon the output of deep learning network 205 before the output is providedto vehicle control module 209. In some embodiments, a trigger classifiercomponent (not shown) is used to identify driving data as potentialtraining data.

In some embodiments, sensors 203 include one or more sensors. Thesensors 203 may be affixed to a vehicle, at different locations of thevehicle, and/or oriented in one or more different directions. Forexample, sensors 203 may be affixed to the front, sides, rear, and/orroof, etc. of the vehicle in forward-facing, rear-facing, side-facing,etc. directions. In some embodiments, sensors 203 may be image sensorssuch as high dynamic range cameras. In some embodiments, sensors 203include non-visual sensors. Sensors 203 may include radar, LiDAR, and/orultrasonic sensors, among others. In certain embodiments, sensors 203are not mounted to the vehicle with vehicle control module 209. Forexample, sensors 203 may be mounted on neighboring vehicles and/oraffixed to the road or environment and are included as part of a systemfor capturing sensor data.

In some embodiments, deep learning network 205 is a deep learningnetwork for implementing autonomous vehicle control. For example, deeplearning network 205 may be an artificial neural network such as aconvolutional neural network (CNN) that is trained using sensor data andits output is provided to vehicle control module 209. The machinelearning model used by deep learning network 205 may be trained usingthe system of FIG. 2 .

In some embodiments, artificial intelligence (AI) processor 207 is ahardware processor for running deep learning network 205. In someembodiments, AI processor 207 is a specialized AI processor forperforming inference using a convolutional neural network (CNN) onsensor data. AI processor 207 may be optimized for the bit depth of thesensor data and/or optimized for deep learning operations such as neuralnetwork operations including convolution, dot-product, vector, and/ormatrix operations, among others. In some embodiments, AI processor 207is implemented using a graphics processing unit (GPU). In variousembodiments, AI processor 207 is coupled to memory that is configured toprovide the AI processor with instructions which when executed cause theAI processor to perform deep learning analysis on the received inputsensor data and to determine a machine learning result used to at leastin part autonomously operate a vehicle.

In some embodiments, vehicle control module 209 is utilized to processthe output of artificial intelligence (AI) processor 207 and totranslate the output into a vehicle control operation. In someembodiments, vehicle control module 209 is utilized to control thevehicle for autonomous driving and can adjust the speed and/or steeringof the vehicle. For example, vehicle control module 209 may be used tocontrol a vehicle by braking, steering, changing lanes, accelerating,and merging into another lane, etc. In some embodiments, vehicle controlmodule 209 is used to control vehicle lighting such as brake lights,turns signals, headlights, etc. In some embodiments, vehicle controlmodule 209 is used to control vehicle audio conditions such as thevehicle’s sound system, playing audio alerts, enabling a microphone,enabling the horn, etc. In some embodiments, vehicle control module 209is used to control notification systems including warning systems toinform the driver and/or passengers of driving events such as apotential collision or the approach of an intended destination. In someembodiments, vehicle control module 209 is used to adjust sensors suchas sensors 203 of a vehicle. For example, vehicle control module 209 maybe used to change parameters of one or more sensors such as modifyingthe orientation, changing the output resolution and/or format type,increasing or decreasing the capture rate, adjusting the captureddynamic range, adjusting the focus of a camera, enabling and/ordisabling a sensor, etc. In various embodiments, vehicle control module209 is used to implement self-driving and/or driver-assisted control ofa vehicle.

In some embodiments, network interface 211 is a communication interfacefor sending and/or receiving data including captured sensor data. Invarious embodiments, a network interface 211 includes a cellular orwireless interface for interfacing with remote servers, such as trainingplatform 223, to connect and make voice calls, to send and/or receivetext messages, to transmit sensor data, to receive updates to theautonomous driving system including newly training machine learningmodels, etc. For example, network interface 211 may be used to receivean update for the instructions and/or operating parameters for sensors203, deep learning network 205, AI processor 207, vehicle control module209, and/or vehicle data capture system 213. For example, a machinelearning model of deep learning network 205 may be updated using networkinterface 211. As another example, network interface 211 may be used toupdate firmware of sensors 203 and/or operating parameters of vehicledata capture system 213 such as filters and/or parameters fordetermining the type and amount of data to capture.

In some embodiments, vehicle data capture system 213 and capture datastore 215 are used for capturing and storing data associated withvehicle driving conditions. The data captured by vehicle data capturesystem 213 is stored in capture data store 215. Capture data store 215may be implemented using any appropriate data store such as a harddrive, non-volatile memory, etc. In some embodiments, capture data store215 is implemented using a database, a file system, or another means fororganizing the data. The captured data of vehicle driving conditions mayinclude image sensor data, vehicle operating parameters (e.g., speed,steering, etc.), vehicle type information (e.g., left-hand drive,right-hand drive, vehicle model, etc.), whether autonomous driving isenabled, the time since the last disengagement of autonomous driving,obstacles detected, driving conditions, etc. The data may be capturedpassively without interfering with the driving of the vehicle andwithout requiring driver assistance. Data captured by vehicle datacapture system 213 includes data captured from sensors 203.

In some embodiments, vehicle data capture system 213 communicates withtraining platform 223 via network interface 211. Network interface 211may be a wireless network such as a WiFi and/or cellular network.Vehicle data capture system 213 utilizes network interface 211 totransmit captured data stored in capture data store 215 to trainingplatform 223. In some embodiments, network interface 211 is utilized todownload a trained machine learning model for installation in deeplearning network 205 running on the vehicle.

In the example of FIG. 2 , the server-side components of the trainingsystem are located in one or more data centers of training datacenter(s) 221 and include training platform 223, training data store227, and model data store 229. Training platform 223 includes one ormore computer servers for receiving captured data from vehicle datacapture system 213. Training platform 223 is communicatively connectedto vehicle data capture system 213 via wireless network interface 211through a computer network, such as a wired or optical network, oftraining data center(s) 221. Training platform 223 further includes oneor more node engines 225. For example, multiple node engines 225 may beconnected in a mesh network. Training platform 223 receives captureddata from vehicle data capture system 213, processes the data intouseable training (and validation) data, and utilizes node engines 225for training one or more new machine learning models. Training datastore 227 is used for storing the received captured data from one ormore vehicles. In some embodiments, processed captured data used astraining data including annotated data is stored in training data store227. Once training is completed, model data store 229 is used to storethe trained machine learning model. For example, different versions oftrained machine learning models may be stored in model data store 229and utilized to determine the relative functionality of the differentmodels and to identify areas of improvement. In some embodiments, one ormore data stores are used to implement training data store 227 and modeldata store 229.

In some embodiments, node engines 225 includes multiple connected nodesthat can be used to parallelize computational tasks. Each connected nodeincludes at least one, and possibly more than one, matrix processor. Forexample, a single node may include eight matrix processors, each capableof determining at least one matrix multiplication result. In someembodiments, a matrix multiplication result takes a single matrixprocessor at least a minimum number of clock cycles to compute. Byscaling each node to include multiple matrix processors, after aninitial delay corresponding to the minimum number of clock cycles tocompute a matrix multiplication, a node can output the result of onematrix multiplication each clock cycle. For example, in the event amatrix multiplication takes eight clock cycles to complete, after aninitial delay of seven clock cycles, a node with eight matrix processorscan determine the result of a matrix multiplication every clock cycle.In various embodiments, the throughput is further determined by memoryaccess including the latency in accessing matrix operands. In variousembodiments, the node engines are able to perform matrix computationsusing a variety of number formats. For example, a node can utilizefixed-point and floating-point number formats. With respect tofloating-point formats, the node is configurable to operate in multipleformats such as 8-bit, 16-bit, and 32-bit formats. For each bit-depth,one or more different formats may be selected. Depending on thecomputational goal, a different format may be used to represent a numbervalue. A format may be selected to allocate more precision to themantissa of a floating-point number and another format may be selectedto allocate more precision to the exponent of a floating-point number.In some embodiments, the floating-point formats utilize a configurablebias to further customize computational operations. The configurabilityof number formats allows the training system to target different machinelearning operations, for example, based on expected input, intermediate,and output values. In various embodiments, the configurability of thenode including support for multiple floating-point formats andfloating-point formats using configurable biases greatly improves thebandwidth and performance for matrix computational operations withoutsacrificing precision and accuracy. Similarly, the power consumption andefficiency is also significantly improved.

FIG. 3 is a block diagram illustrating an embodiment of a node enginefor performing matrix computations. In the example shown, node engine300 includes control unit 301, memory 303, load registers 305,post-processing unit register file 307, multiplexers 309 and 311, matrixprocessors 313 and 351-357, output array 315, and post-processing unit317. In various embodiments, a node engine may include multiple matrixprocessers to compute multiple matrix operations in parallel. In theexample shown, node engine 300 includes eight matrix processors 313 and351-357. Each matrix processor includes a data input array, a weightinput array, multiple output accumulators, and a matrix computationalunit. In the example shown, matrix processor 313 includes data inputarray 321, weight input array 323, and two output accumulators 329 and331. The data and weight input arrays feed input to matrix computationalunit 325. For example, the data in an input array (e.g., data inputarray 321 and/or weight input array 323) is shifted by a certain numberof bytes (e.g., eight bytes) to feed matrix computational unit 325 overmultiple cycles (e.g., eight successive cycles). In some embodiments,each matrix processor includes a single data input array and a singleweight input array. Matrix computation unit 325 includes a matrix ofcomputational cells such as computational cell 327. An MxN dimensionmatrix computational unit includes MxN computational cells. Each inputarray is sized to fit an entire input matrix and each output accumulatoris sized to fit an entire matrix result. In some embodiments, the nodeengine supports multiple floating-point formats including the 8-bitfloating-point formats 400 and 410 of FIG. 4 and the 21-bitfloating-point format 500 of FIG. 5 . In some embodiments, node engine300 is used to perform the processes of FIGS. 1, 6, 7, and/or 8.

In some embodiments, node engine 300 may include additional componentsand additional control lines that are not shown. For example, nodeengine 300 may include additional registers such as scalar registers,one or more memory cache(s), data formatters for formatting values forthe matrix processors, and additional control lines from control unit301 to sub-components such as multiplexers 309 and 311 and matrixprocessors 351-357, as a few examples. In some embodiments, certainregisters (not shown) are dedicated for storing configurable parameterssuch as number formats and configurable biases for floating-pointnumbers. In some embodiments, the buses that connect the differentcomponents of node engine 300 are wide-data buses. The size of the busmay be selected to optimize for transferring matrix values. For example,the buses may all be 64-bytes wide. This allows an 8x8 matrix of 641-byte elements to be transferred from memory, to a register, to thematrix processor, etc., as a contained unit.

In the example shown, control unit 301 is communicatively connected toone or more components of node engine 300 including memory 303, matrixprocessor 313, output array 315, and post-processing unit 317. Althoughnot shown, control unit 301 is also communicatively connected to each ofthe remaining matrix processors 351-357. In various embodiments, controlunit 301 is used to synchronize the processing of computationaloperations including matrix operations and post-processing operations(such as vector operations) and/or access of memory and registers. Forexample, control unit 301 sends signals to matrix processor 313 toschedule a matrix computation instruction and may monitor a ready signalfrom matrix processor 313 to indicate when a new instruction can bereceived and/or when a matrix operation has completed and a matrixresult is ready.

In some embodiments, memory 303 is a memory module for storing the inputoperands and output results of matrix computations and post-processingcomputations. Memory 303 may include one or more caches (not shown). Inthe example shown, memory 303 is connected to load registers 305,multiplexers 309 and 311, and post-processing unit register file 307.Additional or fewer connections are possible depending on theflexibility needed in storing and retrieving data to and from memory. Asshown, data can be read to and from memory into and from load registers305 and post-processing unit register file 307. The connection to theregisters allows data values to be quickly stored in a register, forexample, as arguments for a matrix or vector computation. Memory 303 isalso connected to multiplexers 309 and 311 so that input matrices can beretrieved from memory. In some embodiments, memory access to memory 303is controlled by a memory arbiter (not shown) to optimize memoryrequests, for example, by queuing memory requests and prioritizingcertain memory reads over others. In some embodiments, memory 303 isstatic random access memory (SRAM).

In some embodiments, node engine 300 includes registers such as loadregisters 305 and post-processing unit register file 307. Theseregisters may be used to optimize memory access. As a few examples, theregisters may be used to store values retrieved from memory 303, tostore values prior to writing the values into memory 303, to store inputand output values of a matrix processor, and to store input and outputvalues of a post-processing unit. In some embodiments, post-processingunit register file 307 is a register file for post-processing unit 317and is compatible with different lane configurations (e.g., 64, 32,and/or 16 lane configurations) of post-processing unit 317. For example,the registers of post-processing unit register file 307 can be addressedusing various byte formats such as 1-byte, 2-byte, and 4-byte values. Insome embodiments, each register is 64-bytes in size and can store 641-byte elements, 32 2-byte elements, or 16 4-byte elements. In variousembodiments, the data formats can be configured and include various8-bit, 16-bit, and 32-bit floating-point formats.

In some embodiments, multiplexers are used to select the source of inputoperands to a matrix processor. In the example shown, multiplexers 309and 311 are used to select the source for a data input matrix and weightinput matrix for matrix processor 313. Depending on the control signalreceived at each multiplexer, data can be sourced from memory 303 orpost-processing unit register file 307. In some embodiments, datasourced from memory 303 is retrieved via a register of load registers305. In some embodiments, multiplexers 309 and 311 are also used toselect the data input matrix and weight input matrix for matrixprocessors 351-357. By offsetting the processing of the multiple matrixprocessors of a node engine, a single pair of multiplexers is used toselect the input for all matrix processors of the node engine. Invarious embodiments, multiplexers 309 and 311 are used to control whichmatrix processor receives which matrix operands. Depending on theconfiguration, a single matrix processor, a subset of all matrixprocessors, or all matrix processors receive the selected matrixoperands. In the alternative embodiments, node engine 300 includesadditional multiplexers (not shown) dedicated to each of matrixprocessors 351-357.

In some embodiments, matrix processor 313 receives a matrix operationinstruction and performs a matrix computation such as a matrixmultiplication. For each matrix instruction, matrix processor 313 storesone or more matrix operands in one or more input arrays. For example, adata matrix is stored in a data input array, such as data input array321, and a weight matrix is stored in a weight input array, such asweight input array 323. In various embodiments, the matrix operands area pair of data and weight matrices, a pair of data and gradientmatrices, a pair of weight and gradient matrices, or another appropriatepair of matrix operands. In various embodiments, matrix processor 313 isused to compute multiple related matrix computations as part of theprocess for computing a matrix multiplication of matrices that are toolarge to fit in input arrays 321 and 323 of matrix processor 313. Theresults of the related matrix computations are combined as part of theprocess of computing the matrix multiplication of the larger matrices.In various embodiments, matrix processor 313 interleaves multiple matrixoperations (related or not). For example, matrix processor 313 caninterleave performing one or more related matrix operations on a firstpair of matrices with performing one or more related matrix operationson a second pair of matrices. For example, matrix processor 313 canperform a matrix multiplication on matrices W₁ and D₁ that are part of(e.g., slices of) larger matrices W_(A) and D_(A), respectively, andsubsequently perform a matrix multiplication on matrices W₂ and G₂ thatare part of (e.g., slices of) larger matrices W_(B) and G_(B),respectively. The matrix multiplication results of matrices W₁ and D₁are partial results that are used for computing the matrixmultiplication of larger matrices W_(A) and D_(A) and the matrixmultiplication results of matrices W₂ and G₂ are partial results thatare used for computing the matrix multiplication of larger matrices W₂and G₂. The input matrices W₁ and D₁ and input matrices W₂ and G₂ arestored in a pair of weight and data input arrays, such as arrays 321 and323. In some embodiments, separate output accumulators 329 and 331,respectively, are used to accumulate the intermediate and/or finalresults of W₁ * D₁ and the intermediate and/or final results of W₂ * G₂.For example, output accumulator 329 is used to accumulate theintermediate and/or final results of the matrix multiplicationsassociated with matrices W₁ and D₁ and output accumulator 331 is used toaccumulate the intermediate and/or final results of the matrixmultiplications associated with matrices W₂ and G₂.

In some embodiments, data input array and weight input array are sizedto fit an entire matrix in linearized form. For example, a matrixprocessor capable of performing a matrix multiplication on two matricessized MxN and NxO has an input array of size MxN elements and anotherinput array of size NxO elements for receiving the corresponding MxN andNxO input matrices. In some embodiments, a matrix processor performscomputations on two 8×8 matrices and weight input array and data inputarray are each sized to receive 64 elements. Similarly, outputaccumulators are sized to store an entire result matrix. An outputaccumulator used for storing the result of a matrix multiplicationbetween two matrices sized MxN and NxO is sized to receive MxO elements.In some embodiments, a matrix processor performs computations on two 8×8matrices and stores the intermediate and final matrix results in anaccumulator sized to fit 64 elements corresponding to an 8×8 resultmatrix.

In the example shown, the input arrays feed matrix computation unit 325.Matrix computation unit 325 is made up of a matrix of computationalcells, such as computational cell 327. Each computation cell is aprocessing element that can receive two operands, one element from eachinput matrix, and performs a computation, such as a multiplication, onthe two input operands. In some embodiments, the computation is amultiplication and addition. For example, the two input elements aremultiplied and the result is added to the current result in anaccumulator and stored back into the accumulator. In some embodiments,each computational cell, such as computational cell 327, includes anarithmetic logic unit for performing arithmetic logic operations such asa multiply, a divide, an addition, or a subtraction operation. In someembodiments, multiple operations can be performed in the same clockcycle, such as a multiply and add operation needed for performing apartial dot-product. Each computational cell may include an adder, amultiplier, and/or one or more accumulators corresponding to one or morepairs of data and weight input arrays. In some embodiments, eachcomputational cell, such as computational cell 327, includes afloating-point multiplier and one or more accumulators. Although outputaccumulators 329 and 331 are depicted separate from computational cell327 in FIG. 3 , in some embodiments, corresponding portions of outputaccumulators 329 and 331 are integrated into their respectivecomputational cells. For example, the accumulators of each computationalcell together make up the output accumulators 329 and 331.

In various embodiments, the computational cells of matrix computationunit 325 support floating-point operations such as floating-pointmultiplications and additions. In various embodiments, eachcomputational cell includes a multiplier and one or more accumulators toperform a multiply and addition operating in a single cycle. Prior tothe start of each matrix computation, the designated accumulator may becleared. During the process of performing a matrix computation, thedesignated accumulator is used to accumulate and store intermediateresults. In some embodiments, matrix processor 313 is an 8×8 matrixprocessor and matrix computation unit 325 includes 64 computationalcells. Each cycle, 128 elements can be loaded into matrix computationunit 325, two input elements as operands for each of the 64 computationcells. Each computation cell also has access to an accumulator valuestored in the designated accumulator.

In some embodiments, a matrix multiplication requires multiple clockcycles to complete. For each clock cycle, a single row and single columnis retrieved from the input operands. For example, a row is retrievedfrom the matrix stored in the data input array and a column is retrievedfrom the matrix stored in the weight input array. In some embodiments,the data is retrieved by shifting the data in an input array by anentire row or column. Each row and column is a vector and each vector iscopied across the entire computational unit. Each row is duplicated“down” the rows of matrix computational unit 325 and each column isduplicated “across” the columns of matrix computational unit 325. For an8×8 matrix processor, each column of the weight input matrix is8-elements and each row of the data input matrix is 8-elements. For eachpass, a single weight column is duplicated for each of the eight columnsof matrix computational unit 325 and a single data row is duplicated foreach of the eight rows of matrix computational unit 325. By duplicatingthe data across and down one row and one column at a time, an 8x8 matrixprocessor can complete a matrix multiplication in 8-cycles. During eachcycle, the intermediate result of multiplication and accumulation isstored in a designated accumulator. By the eighth and final cycle, thefinal matrix result is stored in the designated accumulator. A matrixprocessor using different dimensions, for example, 4×4 or 16×16matrices, can be used with corresponding sized input arrays,accumulators, and computational cells.

In some embodiments, the input data elements are 8-bit floating-pointvalues. By utilizing 8-bit values, the bandwidth performance of thematrix processor is significantly improved. By utilizing configurablefloating-point values and configurable biases, the precision andaccuracy required for machine learning training is retained andbandwidth is increased. Utilizing an 8-bit format, a 64-byte x 64-bytematrix processor can compute a matrix multiplication for two 8×8matrices (totaling 128 elements). In contrast, using a 32-bit format, a64-byte × 64-byte matrix processor can compute a matrix multiplicationfor two 4×4 matrices (totaling only 32 elements). By optimizing thematrix elements using a configurable 8-bit floating-point format, thebandwidth for loading matrix elements into a matrix processor isimproved significantly. Power consumption per area is also drasticallyimproved. To prevent overflow and underflow errors, the intermediate andfinal results stored in the designated accumulator utilize a larger bitformat, such as a 21-bit, 27-bit, or another appropriate floating-pointformat. Using 8-bit elements as input elements and storing theintermediate results using a 21-bit format preserves the precision andaccuracy required for training while also maintaining high inputbandwidth to the matrix processor. In various embodiments, each outputaccumulator stores each element of the result matrix using a 21-bitfloating-point number, such as format 500 of FIG. 5 . In someembodiments, matrix processor 313 is an 8×8 matrix processor thatperforms matrix operations using 8-bit floating-point input values andcomputes the intermediate and final matrix results using 21-bitfloating-point values. Input arrays are 64-bytes (64 8-bit elements) andoutput accumulators are 168 bytes (64 21-bit elements). In variousembodiments, the output accumulator is designated by the matrixcomputation instruction. Similarly, the 8-bit floating-point format andexponent bias can be configured by the matrix computation instructionand/or one or more register arguments.

In some embodiments, multiple different 8-bit floating-point formats aresupported by matrix processor 313. For example, different formats 400and 410 are supported and can be selected based on the computation task.Each format allocates a different number of bits to represent theexponent and mantissa of a floating-point number. Depending on the usecase, one or another format is selected. In the event a high precisionnumber is needed, more bit can be allocated to the mantissa and aformat, such as format 400 with more mantissa bits than format 410, isselected. A format with more mantissa bits may be selected forperforming gradient descent where very small deltas are required topreserve accuracy. As another example, a format with more mantissa bitsmay be selected for performing forward propagation to compute a costfunction. As another optimization, each floating-point format utilizes aconfigurable bias. A configurable bias is used to shift the exponentrange. For example, without an exponent bias, an exponent represented by3-bits can specify an exponent value between 2⁰ and 2⁷, inclusive. Abias of 5 shifts the range of the exponents to having an exponent valuebetween 2⁻⁵ and 2⁺², inclusive. As another example, using 4-bits torepresent an exponent and a bias of 15 shifts the range of the exponentfrom 2⁰ and 231, inclusive, to between 2⁻¹⁵ and 2⁺¹⁶, inclusive. Invarious embodiments, by optimizing the number of bits for the exponentfield and the number of bits for the bias, the range expressed using theexponent and the numeric coverage of the float-point number can beoptimized to preserve accuracy and precision for the expected input andresults.

In some embodiments, the floating-point format supports denormalnumbers. For example, an exponent field having a value of zero does notrequire a normalized mantissa with no leading zeros. By supportingdenormal numbers, the exponent range and the number of values that canbe represented is increased. In various embodiments, each computationalcell, such as computational cell 327, includes support for performingfloating-point operations using one or more denormal operands.

In some embodiments, the value of the configurable bias is limited bythe number of bits used to represent the configurable bias. For example,a 3-bit configurable bias can have eight different values (0 through 7,inclusive). In some embodiments, as an optimization, the valuesrepresented by the configurable bias are not consecutive. For example,the eight values represented by a 3-bit configurable bias are notlimited to the values 0 through 7. Instead, the biases are selectablefrom 8 different values. For example, a configurable bias can beselected from eight pre-determined values: 1, 3, 5, 7, 9, 11, 15, and17. In some embodiments, the pre-determined values are determined basedon the most useful biases. The pre-determined values may be selected atleast in part to maximize the range and minimize the overlap between theranges for different biases. In some embodiments, the configurable biasis specified by the matrix processor instruction and/or stored in aregister (not shown). In some embodiments, the configurable bias isreconfigurable. For example, after performing an arithmetic operation,the configurable bias can be reconfigured to adjust to the new range ofthe result. In some embodiments, the reconfiguration is specified aspart of the computational instruction. For example, the instruction mayspecify a new bias that is used to reconfigure the configurable bias.

In some embodiments, the computational cells of the matrix computationalunit can be grouped to also support matrix operations for a larger inputnumber format. For example, the computational cells of an 8×8 matrixcomputational unit that each operate on 8-bit floating-point matrixelements as input can be grouped to perform 4×4 matrix operations using16-bit floating-point matrix elements as input. In some embodiments, theoutput accumulators are sized to prevent the loss of accuracy in thequantized result. For example, a 16-bit floating-point format using asingle bit for a sign bit, 8-bits for the exponent, 7-bits for themantissa, and a non-configurable exponent bias utilizes a 27-bitintermediate floating-point format for floating-point results. A 27-bitfloating-point format may allocate a single bit for a sign bit, 9-bitsfor the exponent, and 17-bits for the mantissa. Support for the groupedoperation mode makes the matrix computational unit more versatile inpart by supporting more operand formats.

In various embodiments, the grouped operation mode performs matrixoperations by splitting an input operand into multiple components andproviding each split component to a different computational cell of thegroup. Each split component is represented as a floating-point numberand when added together, the different split components total theoriginal operand. For example, an input operand is split into the mostsignificant bits (i.e., a high component) and the least significant bits(i.e., a low component) of the operand. In various embodiments, theexponent of the high component uses the same exponent value of the inputoperand whereas the exponent of the low component is adjusted to accountfor subtracting the most significant bits from the input operand. Insome embodiments, the component for the least significant bits isnormalized. In some embodiments, a computational cell supports denormalnumbers and the component can be represented as a denormal number.

In various embodiments, when performing a multiplication on two inputoperands using an operand number format twice the size of thecomputational cell format (e.g., 16-bit floating point operands insteadof 8-bit floating point operands), four computational cells are groupedtogether and each input operand has a corresponding high and lowcomponent. The high and low components of each input operand areprovided to processing elements by pairing high-high, high-low,low-high, and low-low components and providing the different pairs todifferent computational cells of the group. At each computational cellof the group, a matrix multiplication is performed and the result storedin an output accumulator associated with the computational cell. In someembodiments, the output accumulator utilizes a floating-point formatwith a higher number of bits than the original input operand. Forexample, the output accumulator may utilized 27-bits for 16-bit inputoperands that do not have a configurable exponent bias. When the outputresults of the grouped cells are added together, the result is thematrix multiplication of the original input operands. In someembodiments, the results are moved out of the matrix computational unitand added together using a post-processing unit such as a vectorcomputational unit. For example, a floating-point add instruction isused to add the component results to determine a multiplication result.A floating-point vector add instruction can be used to add thecomponents for a vector of results. In various embodiments, the matrixcomputation unit is matrix computation unit 325 of FIG. 3 and thepost-processing unit is post-processing unit 317 of FIG. 3 .

In some embodiments, node engine 300 includes multiple matrix processors313 and 351-357. The functionality and components of matrix processors351-357 are described with respect to matrix processor 313. In someembodiments, each matrix processor requires at least a minimum number ofcycles to complete a matrix multiplication, for example, eight cyclesfor an 8×8 matrix processor. By incorporating multiple matrix processorsin a single node engine, matrix multiplications can be distributed todifferent matrix processors. The resulting output can be staggered toread a matrix result from a different matrix processor each cycle. For aset of eight 8×8 matrix processors, each matrix processor can output amatrix result every eight cycles. Staggering the processors allows amatrix result every clock cycle from a different processor. In someembodiments, a different sized matrix processor, for example, a 4×4 or a16×16 processor, can be used. Similarly a different number of matrixprocessors can be included in the node engine based on the depth of thematrix processor computation pipeline.

In some embodiments, a matrix instruction specifies a particular matrixoperation, a particular matrix processor, designates an accumulator forstoring the matrix result, and specifies the location of the matrixoperands. The location of the matrix operands may be specified using aregister value or a memory address. For example, a matrix instructionmay specify a matrix multiplication, matrix multiplication processor313, output accumulator 329, a register of post-processing unit registerfile 307, and a memory address of memory 303. In some embodiments,control unit 301 issues matrix instructions. In some embodiments,operations include matrix multiplication, matrix addition, dot-product,matrix inverse, etc. In some configurations, the output accumulators ofeach matrix processor uniquely identify a matrix processor. Byspecifying a particular output accumulator as part of the matrixinstruction, the matrix processor is inherently selected. For example,using an A0-A11 naming scheme for accumulators, the first and secondoutput accumulators (e.g., A0 and A1) are mapped to matrix processor313, the third and fourth output accumulators (e.g., A2 and A3) aremapped to matrix processor 351, the fifth and sixth output accumulators(e.g., A4 and A5) are mapped to matrix processor 352, and so forth. Inthe example, accumulators 329 and 331 are referenced as A0 and A1,respectively. A matrix multiply instruction specifying accumulator A1 isissued to matrix processor 313 since only matrix processor 313 can storeresults to accumulator A1.

In some embodiments, output array 315 is used to retrieve the results ofone or more matrix processors. In some embodiments, output array 315includes a multiplexer to determine from which matrix processor to loada result into the output array. In some embodiments, the output array isa 64-byte array and requires two move instructions to move a matrixresult from a matrix processor into the output array. For example, amatrix result using 21-bit floating-point values requires 168 bytes.Each 21-bit floating-point value is converted during a move command to a16-bit floating-point value. Using only two move instructions, a resultmatrix of 64 elements is converted from 64 21-bit to 64 16-bitfloating-point values. For example, a move high instruction moves thehighest 32-elements into the output array and a move low instructionmoves the remaining lowest 32-elements into the output array. In variousembodiments, the output array is 64-bytes so the result of the firstmove is first stored in a register (such as a register ofpost-processing unit register file 307) before the second move isperformed. In various embodiments, the output array is a temporaryoutput array until the values are moved to the memory or register. Insome embodiments, the move instructions are non-destructive and do notclear the matrix result from the matrix processor, for example, byclearing the source accumulator.

In some embodiments, post-processing unit 317 is used to performpost-processing such as normalization, scaling, activation functions,pooling, etc. In some embodiments, post-processing unit 317 is a vectorcomputational engine that operates on each element of a vector. Thepost-processing unit may utilize different number formats such as1-byte, 2-byte, and 4-byte number formats including float-point numberformats. In some embodiments, the number of lanes of the post-processingunit 317 can be configured. For example, a post-processing unit 317 thattakes a 64-byte vector can operate on 64 1-byte elements, 32 2-byteelements, or 16 4-byte elements corresponding to 64, 32, and 16 laneconfigurations. In the example shown, post-processing unit 317 utilizespost-processing unit register file 307 for retrieving data for input andfor storing post-processing results. In some embodiments, additionalpost-processing units (not shown) may be included in the node engine asnecessary to perform additional machine learning functionality.

FIG. 4 is a block diagram illustrating embodiments of an 8-bitfloating-point format. In the example shown, 8-bit floating-pointformats 400 and 410 are different 8-bit floating-point formats forrepresenting a floating-point number using a sign, mantissa, andexponent. In some embodiments, a node engine such as node engine 300 anda matrix processor such as matrix processor 313 of FIG. 3 utilize 8-bitfloating-point formats 400 and 410 for matrix operations. By performingmatrix operations using 8-bit floating-point formats, such as formats400 and 410, instead of a 16-bit, 32-bit, or another floating-pointformat, the bandwidth of the matrix processor is significantlyincreased. In some embodiments, the formats 400 and 410 support aconfigurable bias. The configurable bias allows for a greater range inrepresenting the exponent for improved accuracy while still maintainingthe 8-bit data size. In some embodiments, the floating-point formats 400and 410 supports denormal numbers to increase the number of values thatcan be represented.

In the example shown, 8-bit floating-point format 400 includes a singlebit for sign bit 401, 4-bits for exponent 403, and 3-bits for mantissa405. Sign bit 401, exponent 403, and mantissa 405 take up a total of8-bits and can be used to represent a floating-point number. Similarly,8-bit floating-point format 410 includes a single bit for sign bit 411,5-bits for exponent 413, and 2-bits for mantissa 415. Sign bit 411,exponent 413, and mantissa 415 take up a total of 8-bits and can be usedto represent a floating-point number. In some embodiments, aconfigurable bias is used to bias the exponent. For example, the 4-bitexponent 403 of format 400 allows exponent 403 to have 16 differentvalues (i.e., values 0 through 15, inclusive). Using 4-bits with no bias(or the equivalent of a configurable bias set to zero), exponent 403 canrepresent an exponent with values 2⁰ through 2¹⁵, corresponding to anexponent field with values 0 and 15, respectively. By using aconfigurable bias, the range of the exponent can be shifted. Forexample, using a configurable bias set to a value of 5, exponent 403 canrepresent an exponent with values 2⁻⁵ through 2¹⁰. In variousembodiments, the value of the configurable bias is limited by the numberof bits used to represent the configurable bias. For example, a 3-bitconfigurable bias can have eight different values. In some embodiments,the values represented by the configurable bias are not consecutive. Forexample, the eight values represented by a 3-bit configurable bias arenot limited to the values 0 through 7. Instead, the biases areselectable from 8 different values. For example, a configurable bias canbe selected from eight pre-determined values: 1, 3, 5, 7, 9, 11, 15, and17. In some embodiments, the pre-determined values are determined basedon the most useful biases. In some embodiments, the pre-determinedvalues are selected at least in part to maximize the range of theexponent and to minimize the overlap between the ranges for differentbiases. In some embodiments, the configurable bias is specified by thematrix processor instruction and/or stored in a register (not shown).

In various embodiments, multiple different 8-bit floating-point formats,such as formats 400 and 410, are supported by a matrix processor. Bysupporting multiple formats, the precision can be utilized in either theexponent or the mantissa. For example, certain operations such asgradient descent may require additional precision and thus a greaternumber of bits for the mantissa. As another example, more bits can beused for the mantissa for operations where the values are clusteredclose together and do not need additional range for exponents. Incontrast, for certain operations, the range of values may be greater anda larger range for the exponent is needed. Using format 410, fewer bitsare dedicated for the mantissa and more are dedicated for the exponent.In some embodiments, the format is specified by the matrix processorinstruction and may be stored in a register (not shown). In variousembodiments, additional floating-point formats not depicted may besupported. For example, a 4-bit mantissa and 3-bit exponent format maybe supported (not shown).

FIG. 5 is a block diagram illustrating an embodiment of a 21-bitfloating-point format. In the example shown, floating-point format 500is a 21-bit floating-point format for representing a floating-pointnumber using a sign, mantissa, and exponent. In some embodiments, a nodeengine such as node engine 300 and a matrix processor such as matrixprocessor 313 of FIG. 3 utilize a 21-bit floating-point format, such asformat 500, for certain matrix operations, such as for storing theresults (and intermediate results) of matrix multiplications and/ormatrix additions. In some embodiments, format 500 is used byaccumulators for a matrix processor, such as output accumulators 329 and331 of FIG. 3 . For example, the multiplication result of two 8-bitmultiplication operands may cause an overflow or underflow error if theresult is limited to the same 8-bit format. Using a format larger than8-bits for the result prevents overflow and underflow errors. Similarly,using a 21-bit floating-point format to store intermediate and finalresults when computing matrix multiplication with 8-bit matrix elementsprevents overflow or underflow errors. Using a result with a bit-depthsmaller than 32-bits increases the efficiency of memory usage. Invarious embodiments, format 500 with a bit-depth of 21-bits is used tooptimize for both memory usage and accuracy. In some embodiments, theformat 500 supports a configurable bias. The configurable bias allowsfor a greater range for improved accuracy while still maintaining the21-bit data size. In some embodiments, the configurable bias isspecified by the matrix processor instruction and/or stored in aregister (not shown).

In the example shown, 21-bit floating-point format 500 includes a singlebit for sign bit 501, 7-bits for exponent 503, and 13-bits for mantissa505. Sign bit 501, exponent 503, and mantissa 505 take up a total of21-bits and can be used to represent a floating-point number. In someembodiments, a configurable bias is used to bias the exponent. Forexample, the 7-bit exponent 503 of format 500 allows exponent 503 tohave 128 different values (i.e., values 0 through 127, inclusive). Using7-bits with no bias (or the equivalent of a configurable bias set tozero), exponent 503 can represent an exponent with values 20 through2127, corresponding to an exponent field with values 0 and 127,respectively.

In various embodiments, format 500 is used by one or more accumulators,such as output accumulators 329 and 331 of FIG. 3 , of a matrixprocessor for a node engine, such as node engine 300 and matrixprocessor 313 of FIG. 3 . In some embodiments, a register (not shown) isused to store a setting for the configurable bias used for storing afloating-point number in a particular accumulator. In some embodiments,multiple 21-bit formats (e.g., with different allocations of bits forexponent and mantissa fields) may be used and the particular format isspecified by the matrix processor instruction. The value for theconfigurable bias may be specified using the matrix processorinstruction and/or stored in a register.

Although FIG. 5 depicts a 21-bit floating-point format that can be usedby accumulators for a matrix processor, such as output accumulators 329and 331 of FIG. 3 , formats with alternative bit-depths may be used. Forexample, depending on the operating requirements, such as requirementsfor preventing loss of accuracy, a 27-bit floating-point format may beused to prevent the loss of accuracy in quantized results whensupporting operations on certain 16-bit floating point operations. Asone example, a 27-bit floating-point format may include a single bit fora sign bit, 9-bits for the exponent, and 17-bits for the mantissa. A27-bit floating-point format may be used to accumulate multiplicationoperations on 16-bit floating-point operands. In some embodiments, a16-bit floating-point operand is represented with a single bit for asign bit, 8-bits for the exponent, and 7-bits for the mantissa.

FIG. 6 is a flow diagram illustrating an embodiment of a process forperforming matrix computations. The process of FIG. 6 is used by atraining platform such as training platform 223 of FIG. 2 to performmatrix computations by one or more node engines, such as node engines225 of FIG. 2 or node engine 300 of FIG. 3 . In some embodiments, atraining platform receives one or more matrix computation operations andparallelizes the operations across different node engines. Each nodeengine may then also parallelize its operations across different matrixprocessors. The results may be combined, as appropriate, at one or morenode engines to determine a result, such as a matrix of weights for amachine learning model. In some embodiments, the process of FIG. 6 isperformed as part of step 105 of FIG. 1 .

At 601, a computational instruction is received. In some embodiments,the computational instruction is received by a training platform such astraining platform 223 of FIG. 2 . The training platform processes thecomputational instruction and performs the necessary division anddistribution of work to different node engines. For example, acomputational instruction requesting a convolution of an image with afilter is received at a server of the training platform initiating amachine learning training process. In some embodiments, the instructionmay include the necessary parameters to perform the computationalinstruction including the operations involved and the operands. Forexample, the instruction may include the size of the input operands(e.g., the size of each input matrix), the start address of each inputmatrix, a stride parameter, a padding parameter, and/or matrix, vector,and/or post-processing commands. For example, a computationalinstruction may describe an image data size (e.g., 96 × 96, 1920 × 1080,etc.) and bit depth (e.g., 8-bits, 16-bits, etc.) and a filter size andbit depth, etc. In many scenarios, the matrices of a matrix computationmay be larger than can fit inside a matrix processor so additionalprocessing may be performed to subdivide the computation so that it canbe performed by different node engines or matrix processors.

At 603, matrix operations and operands are determined. In the event oneor more matrices of the computation instruction received at 601 arelarger than the input matrices for a matrix processor, the computationalinstruction of 601 is divided into small component operations. At 603,matrix operations and operands corresponding to smaller componentoperations are determined and may include slicing, segmenting, orpartitioning the original matrix operands into smaller matrices andperforming matrix operations on the smaller matrices. The results of thematrix operations on the smaller matrices may be combined to completethe computation instruction received at 601. Different node engines andmatrix processors may be assigned to perform different components of thecomputational instruction. In some embodiments, the elements of thematrix operands may be converted or targeted for conversion to an 8-bitfloating-point format. An 8-bit floating-point format, such as format400 or format 410 of FIG. 4 , is used by a node engine to increase theprocessing and performance bandwidth as well as the power efficiency ofthe matrix processor. In some embodiments, a configurable bias for acorresponding floating-point format is or will be selected. For example,a format with a high-precision mantissa is selected for performinggradient descent operations.

In various embodiments, a larger matrix is sliced into a smallertwo-dimensional matrix with a size limited to the appropriate dimensionsof a matrix processor. In some embodiments, the sliced matrix is asmaller matrix with addresses to elements referencing the originalmatrix. The sliced matrix may be serialized into a vector forprocessing. In some embodiments, different slices of the matrix mayoverlap with previous slices. In various embodiments, matrices may besliced only at boundaries corresponding to multiples of the read buffersize. For example, in the event each read buffer is 8-bytes in size,each row of a sliced matrix must begin with an address having a multipleof eight. In the event a matrix fits within the computational array, noslicing is required (i.e., the matrix slice used is simply the originalmatrix).

At 605, matrix operations are distributed and performed. For example,the matrix operations corresponding to the matrix operations andoperands determined at 603 are distributed to one or more node enginesand to one or more matrix processors of the node engines. In variousembodiments, the matrix operations are performed by one or more matrixprocessors using 8-bit element matrices. The values for the elements ofthe matrix results are accumulated into 21-bit, 27-bit, or anotherappropriate floating-point format. In various embodiments, the matrixresults can be moved out of the matrix processor in one of severalformats including 8-bit, 16-bit, and 32-bit floating-point formats. Invarious embodiments, each node engine can perform multiple matrixoperations in parallel by utilizing multiple matrix processors.

In some embodiments, references to the matrix operands are distributedalong with the operations to a node engine. In this manner, the nodeengine can perform a data read to load the corresponding elements of thesliced matrices. In some embodiments, the node engine will linearize asliced matrix for loading into memory and/or a register where the inputmatrix can then be sent to a matrix processor. In some embodiments, acontrol unit of the node engine coordinates the scheduling, issuing, andsynchronization of operations including the loading of sliced matrixoperands (including addressing specified strides, paddings, and otherparameters of the matrix operands) and the operation of the matrixprocessors. Once a matrix operation is issued to a matrix processor, thematrix processor will take a certain number of clock cycles to completethe matrix operation. In some embodiments, the matrix processor performsmatrix operations using the processes of FIGS. 7 and/or 8.

At 607, post-processing is performed. In some embodiments,post-processing may be performed by node engines and may includeadditional vector operations performed after the completion of a matrixoperation. Post-processing operations can be performed by apost-processing unit, such as a vector processor or vector computationalunit, of the node engine. In some embodiments, vector post-processingincludes performing complex operations such as arithmetic operations,scaling, normalization, and/or the application of an activation functionsuch as a rectified linear unit (ReLU) function on each element of avector. In some embodiments, the elements of the vector may beconverted/formatted to 8-bit, 16-bit, or 32-bit elements depending onthe precision needed. In various embodiments, the results of thedistributed matrix operations by each node engine may be sent back to orredirected by the training platform server and used for furtherprocessing. For example, the results of matrix operations distributedand performed at 605 may be combined and utilized as operands foradditional vector or matrix operations. After post-processing isinitiated at 607, processing loops back to 601 to receive additionalcomputational instructions. In some embodiments, post-processing doesnot need to complete before processing loops back to 601 for additionalcomputational instructions.

FIG. 7 is a flow diagram illustrating an embodiment of a process forperforming matrix computations. The process of FIG. 7 is used by amatrix processor such as matrix processors 313 and 351-357 of nodeengine 300 of FIG. 3 to perform matrix computations. In someembodiments, each matrix processor of a node engine can perform theprocess of FIG. 7 in parallel. For example, matrix processors 313 and351-357 each perform the process of FIG. 7 in parallel on differentmatrix arguments, although each may be at a different step forprocessing to stagger the completion of their respective operations. Insome embodiments, the process is utilized to perform a convolution usinga data matrix and a weight matrix. In some scenarios, the input matricesare slices of larger matrices. In various embodiments, the process ofFIG. 7 may be initiated by a matrix computation instruction via acontrol unit. The instruction may specify the two matrix operands (e.g.,the memory or register locations of a data and a weight matrix), aconfigurable bias, a floating-point format, and a designated accumulatorto store the matrix computation result. In some embodiments, thedesignated accumulator is zeroed out before the matrix computationbegins. In some embodiments, the designated accumulator is outputaccumulator 329 or 331 of FIG. 3 . In some embodiments, the process ofFIG. 7 is performed at 605 of FIG. 6 .

At 701, a data input matrix is received. For example, elements of a datainput matrix corresponding to training sensor data are linearized andstored in a data input array of a matrix processor. In some embodiments,a data input matrix is stored in a data input array, such as data inputarray 321 of matrix processor 313 FIG. 3 . Each data input array iscapable of storing an entire linearized matrix for the correspondingmatrix processor to be processed by the matrix computational unit. Thusa matrix processor capable of multiplying two 8×8 matrices uses a datainput array capable of storing all 64 elements of an input 8×8 datamatrix. For example, in some embodiments, each data input array is 64bytes and stores each element as an 8-bit floating-point number. Theformat for the floating-point number may use format 400 or 410 of FIG. 4and include a configurable bias. The configurable bias may be specifiedby a matrix instruction and/or by a register. The received data inputmatrix may be received from a register or from memory, such as SRAM. Insome embodiments, one or more reads are issued to load the entire datainput matrix to the matrix processor but the entire matrix is notavailable at once. For example, for a sliced matrix, data for some rows(or columns) may require additional delay before the data is available.Thus the data for the data input array might arrive piecemeal. In someembodiments, a single read is sufficient to load the entire data inputmatrix. In some embodiments, the data input matrix is a gradient inputmatrix.

At 703, a weight input matrix is received. For example, elements of aweight input matrix corresponding to machine learning weights of afilter are linearized and stored in a weight input array of a matrixprocessor. In some embodiments, a weight input matrix is stored in aweight input array, such as weight input array 323 of matrix processor313 FIG. 3 . Each weight input array is capable of storing an entirelinearized matrix for the corresponding matrix processor to be processedby the matrix computational unit. Thus a matrix processor capable ofmultiplying two 8×8 matrices uses a weight input array capable ofstoring all 64 elements of an input 8×8 weight matrix. For example, insome embodiments, each weight input array is 64 bytes and stores eachelement as an 8-bit floating-point number. The format for thefloating-point number may use format 400 or 410 of FIG. 4 and include aconfigurable bias. The configurable bias may be specified by a matrixinstruction and/or by a register. The received weight input matrix maybe received from a register or from memory, such as SRAM. In someembodiments, one or more reads are issued to load the entire weightinput matrix to the matrix processor but the entire matrix is notavailable at once. For example, for a sliced matrix, weight data forsome rows (or columns) may require additional delay before the weightdata is available. Thus the weight data for the weight input array mightarrive piecemeal. In some embodiments, a single read is sufficient toload the entire weight input matrix. In some embodiments, the weightinput matrix is a gradient input matrix.

At 705, a pair of vector arguments is loaded into the matrixcomputational unit. From each input matrix, a vector corresponding to arow and a vector corresponding to a column are loaded as input argumentsinto the matrix computational unit such as matrix computational unit 325of FIG. 3 . As part of the loading process, the column vector isduplicated across the entire matrix computation unit and the row vectoris duplicated down the entire matrix computation unit. For example, anentire vector corresponding to a column of the weight input matrix isloaded into the computational unit. Each element of the column vector isduplicated across an entire row. Thus each column of an 8×8 matrixcomputational unit receives the same 8-element column vector and thevalue loaded to each cell of a row of the matrix computation unit is thesame. Similarly, an entire vector corresponding to a row of the datainput matrix is loaded into the computational unit and each element ofthe row vector is duplicated down an entire column. Thus each row of an8×8 matrix computational unit receives the same 8-element column vectorand the value loaded to each cell of a column of the matrix computationunit is the same. For an 8×8 matrix computational unit, one eighth ofthe input matrix elements is loaded. At 705, an unloaded pair of vectorsfrom each input matrix is loaded into the matrix computational unit.Each subsequent loop through step 705 loads the next available columnand row from the input weight and data matrices. Thus, an 8×8 matrixrequires at least 8 cycles to complete loading whereas a 4×4 matrixrequires at least 4 cycles to complete loading.

At 707, values of the loaded vectors are multiplied. For eachcomputational cell (such as computational cell 327 of FIG. 3 ) of thematrix computational unit, a matrix multiplication is performed usingthe element loaded at the corresponding computational cell. In variousembodiments, the multiplication is performed on two 8-bit floating-pointvalues and stored as a higher-bit floating-point value to preventoverflow and to maintain precision. In some embodiments, the higher-bitfloating-point format is the 21-bit floating-point format of FIG. 5 . Insome embodiments, the higher-bit floating-point format is a 27-bitfloating-point format to further reduce the loss of accuracy in thequantized result. For an 8×8 matrix computational unit, each of the 64computational cells performs a matrix multiplication.

At 709, multiplication results are accumulated into a designatedaccumulator. For example, the multiplication results of eachcomputational unit at 707 are each accumulated into one of theaccumulators of the matrix processor. In some embodiments, a matrixprocessor includes more than one accumulator, such as the two outputaccumulators 329 and 331 of FIG. 3 . This is beneficial so that thematrix processor can interleave the operation of different matrixoperations. In some embodiments, each computational cell includes anaccumulator that adds the current value of the element in theaccumulator corresponding to that computational cell to the result ofthe cell’s matrix multiplication. In various embodiments, theaccumulator is sized to store an accumulation result for each element ofthe matrix. Thus each accumulator of an 8×8 matrix computational unithas at least 64 elements. In some embodiments, similar to the result ofmultiplication at 707, the elements of the accumulator use a higher-bitfloating-point value than the input to the matrix processor to preventoverflow and to maintain precision. In some embodiments, the higher-bitfloating-point format is the 21-bit floating-point format of FIG. 5 oranother higher-bit floating-point format. In some embodiments, anaccumulator for an 8×8 matrix computational unit is 168-bytes to allowfor 64 elements, each storing a 21-bit floating point number.

At 711, a determination is made whether there are additional vectorsremaining for the matrix operation. For example, in order to multiplytwo matrices, at most one column from the weight input matrix and onerow from the data input matrix are loaded for each clock cycle. Tocomplete the entire matrix multiplication, every column and every rowmust be loaded. An 8×8 matrix requires at least 8 cycles to completelyload both input matrices into the matrix computational unit. Similarly,a 4×4 matrix requires at least 4 cycles to completely load both inputmatrices into the matrix computational unit. In the event there areadditional vectors remaining to be loaded, processing continues back to705. In the event no additional vectors remain to be loaded (both entireinput matrices have been loaded), the matrix multiplication is completeand processing continues to 713.

At 713, a matrix result is loaded into an output array from thedesignated accumulator. Since the matrix computation is complete, thematrix result is stored in the designated accumulator. In someembodiments, the elements of the matrix are stored in the designatedaccumulator as 21-bit floating-point values. Thus for an 8×8 matrix, theaccumulator stores 64 values and is 168 bytes in size. In someembodiments, multiple move operations are needed to move the result fromthe accumulator to an output array, such as output array 315 of FIG. 3 .In some embodiments, the output array and bus to the output array are64-bytes wide. The accumulator results are converted from 21-bitfloating-point values into 16-bit floating-point values that can bestored in two 64-byte components. Using the 8×8 result matrix as anexample, two move operations are needed to move the results from theaccumulator of the matrix processor. For example, a move high operationis used to move the high bits of the accumulator (corresponding to 32elements of the matrix) into a 64-bit output array as 16-bitfloating-point values. Once moved in the output array, the 32 elementscan be stored in a register, such as one of the registers of thepost-processing unit register file 307 of FIG. 3 or moved to memory.Subsequently a move low operation is used to move the low bits of theaccumulator (corresponding to the remaining 32 elements of the matrix)into the 64-bit output array as 16-bit floating-point values. Once inthe output array, the remaining 32 elements can also be stored in aregister. In various embodiments, two or more operations are needed tomove the matrix results out of the matrix processor. By converting the21-bit floating-point values to 16-bit floating-point values, only twomove operations are needed. In some embodiments, the values can be movedout as 8-bit, 16-bit, or 32-bit floating-point values. In the exampledescribed, the values are moved out as 16-bit values for laterprocessing by a post-processing unit such as a post-processing unit 317of FIG. 3 . In some embodiments, the post-processing unit is a vectorcomputational engine. In various embodiments, the output array isconnected to accumulators of each matrix processor of the node engineand acts as a multiplexer to receive the results of moves (e.g., highand low move instructions) from the different matrix processors.

FIG. 8 is a flow diagram illustrating an embodiment of a process forperforming multiple interleaved matrix computations. The process of FIG.8 is used by a matrix processor such as matrix processors 313 and351-357 of node engine 300 of FIG. 3 to interleave multiple matrixcomputations such as two matrix multiplication operations. Each of theinterleaved matrix computations may be implemented using multipleintermediate matrix multiplications with the results of the intermediatemultiplications being used to compute the larger matrix computation. Toimprove the processing bandwidth and efficiency, the result of eachintermediate matrix multiplication is stored in the matrix processor andnot cleared when interleaving an alternate matrix operation. Thedifferent matrix operations can be distinct and each havenon-overlapping matrix operands.

In some embodiments, each matrix processor of a node engine can processmore than one matrix operation at a time, one matrix operationcorresponding to each output accumulator of a matrix processor. In someembodiments, the ability to interleave multiple matrix operations allowsmatrix multiplication operations on very large matrices to be performed.The larger matrices are sliced into smaller matrices that fit the inputarray of the matrix processor and the results of matrix multiplicationsof smaller matrices are combined. In various embodiments, the ability tointerleave multiple matrix operations increases the bandwidth andperformance of the matrix processor by utilizing the matrixcomputational unit, for example, while waiting for memory reads tocomplete. Thus, when input operands for a pending matrix operation of afirst set of related matrix operations are not available (e.g., due tothe latency of a memory read) but the input operands for a pendingmatrix operation of a second set of related matrix operations areavailable, the second set of related matrix operations can utilize thematrix computational unit. By utilizing multiple accumulators, thematrix computational unit can switch between multiple matrixcomputations by storing intermediate results in accumulators dedicatedto particular sets of related matrix operations. In some embodiments,the data input array is data input array 321 of FIG. 3 , the weightinput array is weight input array 323 of FIG. 3 , and the multipleaccumulators are output accumulators 329 and 331 of FIG. 3 . Althoughtwo accumulators are shown with respect to matrix processor 313 of FIG.3 , additional accumulators may be included to allow additional matrixoperations to be interleaved.

The process of FIG. 8 is a specialized variation of the process of FIG.7 that utilizes multiple weight input array operands, multiple datainput array operands, and multiple output accumulators to supportinterleaving two matrix multiplication operations. As described withrespect to FIG. 7 , the process of FIG. 8 similarly implements the stepsof FIG. 7 including the loading of a column vector across the matrixcomputation unit, the loading of a row vector down the matrixcomputation unit, the multiplication of operands by computational cells,and the accumulation of the multiplication results in a designatedaccumulator but takes care to not intermingle or wipe the intermediateresults of the two interleaved matrix operations. In some embodiments,the process of FIG. 8 is performed at 605 of FIG. 6 .

At 801, a determination is made whether the matrix processor can receivean additional matrix operation instruction. In the example of FIG. 8 ,the matrix processor is capable of interleaving two matrix operations. Adetermination is made whether there are currently two matrix operationsin the process of being performed. In the event the matrix processor canreceive an additional matrix operation instruction, processing continuesto 803. For example, the matrix processor can receive an additionalmatrix operation instruction since it is in the middle of processingonly a single matrix operation or is idle and not processing any matrixoperations. In the event the matrix processor cannot receive anadditional matrix operation instruction, processing loops back to 801until the matrix processor is available to receive a new matrixoperation instruction. For example, the matrix processor is currently inthe middle of processing two matrix operations and cannot receiveanother operation until at least one of the current operationscompletes. In some embodiments, a ready signal is issued to a controlunit to signal that the matrix processor is ready to receive additionalinstructions.

At 803, the matrix processor receives a matrix instruction and issuesread requests for the associated matrix operations. For example, amatrix processor receives a matrix multiply instruction with twooperands corresponding to two input matrices. Reads are issued for thevalues of the matrix operands. The values may be read from a registerand/or memory. For example, the matrix arguments may specify a registerand/or an address in memory. In some embodiments, a memory read maystall the matrix computation since a memory read may take multiple clockcycles for the data to be available. In some embodiments, multiplememory reads may be issued since the matrix is not stored sequentiallyin memory. This may be a result of a larger matrix being sliced into asmaller matrix operand.

In some embodiments, the instruction received specifies a particularaccumulator to store the matrix result. In order to interleave multiplematrix operations, each operation utilizes its own accumulator. Thedesignated accumulator is used to store the intermediate and finalmatrix results. In some embodiments, the designated accumulator storesintermediate results using a higher-bit floating-point format than theformat used for input operands. The higher-bit format minimizes the lossof accuracy when results are quantized.

In various embodiments, when the data corresponding to the matrixoperands is available, the values are received and prepared for thematrix processor. In some embodiments, the matrix operands are too largefor the matrix processor and multiple intermediate matrix operations areperformed to complete the matrix instruction. In the event data is notavailable, the matrix computational unit may stall and be idle. Insteadof remaining idle, a second matrix operation may be performed as long asdata for the second operation is available.

At 803, processing continues to both 801 and 805. The processing loopsback to 801 to fetch new instructions while also simultaneouslycontinuing to 805 to execute the instruction received at 803. In variousembodiments, the fetching of new instructions happens in parallel withthe processing of the current matrix operations. In some embodiments,the two processing branches to 801 and 805 are implemented using apipelined-based approach.

At 805, a determination is made whether data is ready for the currentmatrix operation. For example, the elements to be loaded from the matrixoperands of the current matrix operation must be available to be loadedto the computational cells of the matrix computational unit. In someembodiments, the data loaded into the matrix computational unit areslices of the matrix operands that are sized for the input arrays of thematrix computational unit. For the weight input array, the pendingcolumns of elements must be ready. For the data input array, the pendingrows of elements must be ready. In the event the elements of weightcolumn and data rows for the current matrix operation are available,processing continues to 807. In the event the pending elements for thecurrent matrix operation are not available, processing continues to 813.For example, the pending elements may not be available due to thelatency from a memory read and/or a cache miss. Instead of stallingwhile waiting for the data to become available, the matrix computationunit may potentially be utilized for an alternative matrix operation.

At 807, the values from the weight columns and data rows for the currentmatrix operation are loaded to corresponding computational cells,compute operations are performed on the values, and the compute resultis accumulated into the designated accumulator. In some embodiments, thecompute operations are multiply operations corresponding to multiplyingelements from two different matrices. In some embodiments, the processat 807 is described with respect to steps 701, 703, 705, 707, 709,and/or 711 of FIG. 7 . For example, the values are loaded as 8-bitfloating-point values with a configurable bias. The result of thecomputation, such as a multiplication, and the accumulation is stored asa 21-bit floating-point format in the first accumulator. In somescenarios, additional configuration related to the matrix operation isperformed at 807 such as clearing the accumulator, determining afloating-point format, and/or determining a configurable bias for afloating-point format, among others.

At 809, a determination is made whether matrix instruction for thecurrent matrix operation is complete. In the event the matrixinstruction is complete, processing continues to 811. In the event thematrix instruction is not complete, processing continues to 805 where adetermination is made whether additional data for the current matrixoperation is ready to be loaded and processed by the matrixcomputational unit. In some embodiments, the process at 809 is describedwith respect to step 711 of FIG. 7 .

In some alternative embodiments (not shown), in the event the matrixinstruction is not complete, processing continues to 813 where adetermination is made whether an alternate matrix operation is pendingand whether data for the pending alternate matrix operation is ready tobe loaded and processed by the matrix computational unit. Under thisalternative embodiment, instead of completing the current matrixoperation, as long as data is available, the matrix computational unitcontinuously alternates back and forth between two different matrixoperations, as long as there are two concurrent matrix operations.

At 811, the matrix result stored in the designated accumulator is loadedinto an output array. Since some embodiments store the resulting matrixusing a higher bit-depth floating-point format, such as a 21-bit or27-bit floating-point format, moving the result out of the matrixprocessor may require multiple move instructions. In some embodiments,the matrix result is moved into two 64-byte registers via an outputarray by first converting the matrix elements into 16-bit floating pointvalues. In some embodiments, the process at 811 is described withrespect to step 713 of FIG. 7 . Processing loops back to step 805 wherethe matrix processor is ready to begin a matrix operation or to makeprogress on an alternate matrix operation, if pending.

In some alternative embodiments (shown as a dotted line), processingcontinues to 813 where a determination is made whether an alternatematrix operation is pending and whether data for the pending alternatematrix operation is ready to be loaded and processed by the matrixcomputational unit. Under this alternative embodiment, once the currentmatrix instruction is completed, the matrix computational unit switchesto an alternate matrix operation in the event that there was analternate matrix operation pending completion.

At 813, a determination is made whether an alternate matrix operation ispending and whether data for the pending alternate matrix operation isready to be loaded and processed by the matrix computational unit. Forexample, in the event a second matrix operation is received at 803 whileprocessing a first matrix operation, a second matrix operation pendingcompletion will have issued reads for its corresponding matrixarguments. A determination is made whether there is a second alternatematrix operation pending and whether its data is ready to be loaded intothe matrix computational unit. In the event the operand data for analternate matrix operation is available, processing continues to 815. Insome embodiments, the operand data are slices of larger operand matricesthat are sized for the input arrays of the matrix computational unit.For the weight input array, the pending columns of elements must beready. For the data input array, the pending rows of elements must beready. In the event there is not a pending alternate matrix operation orthe pending elements for the alternate matrix operation are notavailable, processing continues to 805. For example, the pendingelements may not be available due to the latency from a memory readand/or a cache miss. Instead of stalling while waiting for the data tobecome available, the availability of the data corresponding to thecurrent matrix operation is checked again. The first matrix operationwith available data will have its data loaded into the matrixcomputational unit for processing.

At 815, the matrix processor including the matrix computation unit isswitched to perform processing on the alternate matrix operation that ispending completion. The alternate matrix operation is now designated asthe current matrix operation and the previously current matrix operationis designated as the alternate matrix operation. Since a first matrixoperation may have stalled (or in some embodiments, completed), thematrix computational unit will now work on the second matrix operationthat was pending completion. In various embodiments, the correspondingoutput accumulator is designated, as appropriate, as the source forprevious intermediate results and a destination for accumulatingintermediate and final results. Processing continues to 807 wherecomputation progress is made on the newly designated current matrixoperation.

Although the foregoing embodiments have been described in some detailfor purposes of clarity of understanding, the invention is not limitedto the details provided. There are many alternative ways of implementingthe invention. The disclosed embodiments are illustrative and notrestrictive.

1-20. (canceled)
 21. A microprocessor system, comprising: a plurality ofmatrix processors which form a node engine; a control unit configured toprovide matrix processors instructions to the matrix processors; and aplurality of processor registers allocated to store a plurality of bitdepth formats, wherein the node engine is configured to execute aplurality of matrix processor instructions, wherein operationsassociated with the matrix processor instructions are interleaved forexecution by the matrix processors which form the node engine, andwherein the matrix processors are configured to execute the operationsin different bit depth formats based on the associated matrix processorinstructions.
 22. The microprocessor system of claim 21, wherein a firstbit depth format is an 8-bit floating point format, and wherein inputdata associated with a particular matrix processor instruction isformatted according to the first bit depth format.
 23. Themicroprocessor system of claim 22, wherein a second bit depth format isa 21-bit floating point format, and wherein intermediate resultsassociated with the particular matrix processor instruction areformatted according to the second bit depth format.
 24. Themicroprocessor system of claim 21, wherein a particular matrix processorinstruction specifies a configurable exponent bias.
 25. Themicroprocessor system of claim 21, wherein the matrix processorsinterleave the operations based on data availability.
 26. Themicroprocessor system of claim 21, wherein each of the plurality ofmatrix processors includes a plurality of accumulators.
 27. Themicroprocessor system of claim 26, wherein each matrix processor usesdifferent accumulators for operations associated with different matrixprocessor instructions.
 28. The microprocessor system of claim 27,wherein for a particular matrix processor, the different accumulatorsare designated to accumulate and store intermediate results associatedwith the different matrix processor instructions.
 29. The microprocessorsystem of claim 28, wherein the different accumulators store theintermediate results using a first bit depth format with a higher-bitfloating point format than a second bit depth format used for inputoperands.
 30. The microprocessor system of claim 26, wherein aparticular matrix processor instruction specifies a particularaccumulator for operations associated with the particular matrixprocessor instruction.
 31. A microprocessor system, comprising: a matrixprocessor which forms at least part of a node engine; a control unitconfigured to provide matrix processors instructions to the matrixprocessor; and a plurality of processor registers allocated to store aplurality of bit depth formats, wherein the matrix processor isconfigured to execute, at least in part, a plurality of matrix processorinstructions, wherein the matrix processor interleaves execution of asubset of operations associated with the matrix processor instructions.and wherein the matrix processor is configured to execute the subset ofthe operations in different bit depth formats based on the associatedmatrix processor instructions.
 32. The microprocessor system of claim31, wherein the matrix processor interleaves the operations based ondata availability.
 33. The microprocessor system of claim 31, whereinthe matrix processor interleaves execution of the subset based on use ofa plurality of accumulators.
 34. The microprocessor system of claim 31,wherein a particular matrix processor instruction specifies use of aparticular accumulator of a plurality of accumulators included in thematrix processor for operations associated with the particular matrixprocessor instruction.
 35. The microprocessor system of claim 34,wherein the particular accumulator stores intermediate results using afirst bit depth format with a higher-bit floating point format than asecond bit depth format used for input operands.
 36. The microprocessorsystem of claim 31, wherein a particular matrix processor instructionspecifies a first floating-point matrix operand and a secondfloating-point matrix operand, and wherein the first and secondfloating-point matrix operands are formatted using a first bit depthformat.
 37. The microprocessor system of claim 31, wherein a particularmatrix processor instruction specifies a configurable exponent bias. 38.A method implemented by a node engine comprising a plurality of matrixprocessors, the method comprising: receiving, from a control unit of thenode engine, a plurality of matrix processor instructions; andexecuting, via the node engine, the matrix processor instructions,wherein operations associated with the matrix processor instructions areinterleaved for execution by the matrix processors which form the nodeengine, and wherein the matrix processors are configured to execute theoperations in different bit depth formats based on the associated matrixprocessor instructions.
 39. The method of claim 38, wherein each matrixprocessor includes a plurality of accumulators, and wherein the matrixprocessors interleave execution of the subset based on use ofaccumulators for respective matrix processor instructions.
 40. Themethod of claim 39, wherein for each matrix processor, a particularaccumulator of the plurality of accumulators stores intermediate resultsusing a first bit depth format with a higher-bit floating point formatthan a second bit depth format used for input operands.